19 research outputs found

    Characterizing in-text citations in scientific articles: A large-scale analysis.

    Get PDF
    We report characteristics of in-text citations in over five million full text articles from two large databases – the PubMed Central Open Access subset and Elsevier journals – as functions of time, textual progression, and scientific field. The purpose of this study is to understand the characteristics of in-text citations in a detailed way prior to pursuing other studies focused on answering more substantive research questions. As such, we have analyzed in-text citations in several ways and report many findings here. Perhaps most significantly, we find that there are large field-level differences that are reflected in position within the text, citation interval (or reference age), and citation counts of references. In general, the fields of Biomedical and Health Sciences, Life and Earth Sciences, and Physical Sciences and Engineering have similar reference distributions, although they vary in their specifics. The two remaining fields, Mathematics and Computer Science and Social Science and Humanities, have different reference distributions from the other three fields and between themselves. We also show that in all fields the numbers of sentences, references, and in-text mentions per article have increased over time, and that there are field-level and temporal differences in the numbers of in-text mentions per reference. A final finding is that references mentioned only once tend to be much more highly cited than those mentioned multiple times.Merit, Expertise and Measuremen

    A principled methodology for comparing relatedness measures for clustering publications

    Get PDF
    There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as the evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and cocitation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.Merit, Expertise and Measuremen

    Design and update of a classification system: The UCSD map of science

    Get PDF
    Global maps of science can be used as a reference system to chart career trajectories, the location of emerging research frontiers, or the expertise profiles of institutes or nations. This paper details data preparation, analysis, and layout performed when designing and subsequently updating the UCSD map of science and classification system. The original classification and map use 7.2 million papers and their references from Elsevier's Scopus (about 15,000 source titles, 2001-2005) and Thomson Reuters' Web of Science (WoS) Science, Social Science, Arts & Humanities Citation Indexes (about 9,000 source titles, 2001-2004)-about 16,000 unique source titles. The updated map and classification adds six years (2005-2010) of WoS data and three years (2006-2008) from Scopus to the existing category structure-increasing the number of source titles to about 25,000. To our knowledge, this is the first time that a widely used map of science was updated. A comparison of the original 5-year and the new 10-year maps and classification system show (i) an increase in the total number of journals that can be mapped by 9,409 journals (social sciences had a 80% increase, humanities a 119% increase, medical (32%) and natural science (74%)), (ii) a simplification of the map by assigning all but five highly interdisciplinary journals to exactly one discipline, (iii) a more even distribution of journals over the 554 subdisciplines and 13 disciplines when calculating the coefficient of variation, and (iv) a better reflection of journal clusters when compared with paper-level citation data. When evaluating the map with a listing of desirable features for maps of science, the updated map is shown to have higher mapping accuracy, easier understandability as fewer journals are multiply classified, and higher usability for the generation of data overlays, among others

    Accurately identifying topics using text: Mapping PubMed

    No full text
    Recently, citation links have been shown to produce accurate delineations of tens of millions of scientific documents into a large number (~100,000) of clusters (Sjögårde & Ahlgren, 2018). Such clusters, which we refer to as topics, can be used for research evaluation and planning (Klavans & Boyack, 2017a) as well as to identify hot and/or emerging topics (Small, Boyack, & Klavans, 2014). While direct citation links have been shown to produce more accurate topics using large citation databases than co-citation or bibliographic coupling links (Klavans & Boyack, 2017b), no such comparison has been done at a similar scale using topics based on textual relatedness due to the extreme computational requirements of calculating an enormous number of document-document similarities using text. Thus, we simply do not know if topics identified from a large database using textual characteristics are as accurate as those that are identified using direct citation. This paper aims to fill that gap. In this work we cluster over 23 million documents from the PubMed database (1975-2017) using a text-based similarity and compare the accuracy of the resulting topics to those from existing citation-based topics using three different measures

    What is the organizing principle for large topics?

    No full text
    We know a great deal about how to identify the topics that researchers are working on. One can use citations and/or text to identify about one hundred thousand document clusters from either the Scopus database or the WoS database. For purposes of discussion, we refer to these document clusters as topics. In our models there are about a thousand very large topics and tens of thousands of small topics. But why doesn’t topic size follow an expected linear Zipfian distribution? Is it possible that there is a different organizing principle for large vs. small topics? In this study, we explore the possibility that the organizing principle for large topics is the continued use of very expensive tools (such as specialized equipment, specialized databases and specialized software). For our initial exploration, we use grant size (NSF or NIH grants in excess of $5 million annually) as proxy for preferential investment in specialized tools. By using links between 52,097 grants and tens of thousands of topics, we will test whether large topics get more (than expected) funding from large grants and, by inference, expensive tools are an organizing principle for large topics

    Galileo's stream: A framework for understanding knowledge production

    No full text
    We introduce a framework for understanding knowledge production in which: knowledge is produced in stages (along a research to development continuum) and in three discrete categories (science and understanding, tools and technology, and societal use and behavior); and knowledge in the various stages and categories is produced both non-interactively and interactively. The framework attempts to balance: our experiences as working scientists and technologists, our best current understanding of the social processes of knowledge production, and the possibility of mathematical analyses. It offers a potential approach both to improving our basic understanding, and to developing tools for enterprise management, of the knowledge-production process.
    corecore